EDA and Visualisations¶

Overview¶

This project undertakes an extensive exploration of a rich dataset, focusing on a variety of graphics processing units (GPUs). We aim to analyze and visualize different facets of this data, shedding light on patterns, relationships, and trends that can provide actionable insights and contribute to better decision-making. Specifically, our exploratory data analysis revolves around the following key themes:


  1. Brand and Model Analysis: Our analysis also centers on a comparative study of various brands and models of GPUs. We aim to understand their relative popularity and performance, striving to uncover patterns specific to individual brands or models.

  1. Condition Analysis: The condition of GPUs, specifically, whether they are new or pre-owned, is another dimension that our study explores. We seek to decipher the potential relationship between a GPU's condition and variables like its price and performance.

  1. Performance Analysis: A critical part of our study involves assessing the performance of different GPU models and brands. Through the analysis of the 'performance_score' and 'powerPerformance' variables, we aim to identify the GPUs that offer superior performance and understand the correlation between performance and price.

By focusing on these areas, this exploratory data analysis project aspires to extract comprehensive, meaningful insights from the GPU dataset, adding significant value to stakeholders interested in the dynamics of the GPU market.

Imports and Loads¶

In [2]:
# Configure matplotlib for Jupyter notebook
%matplotlib inline

# Numerical libraries
import numpy as np
import pandas as pd

# Data Visualization libraries
from itertools import cycle
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.colors import ListedColormap, Normalize, to_hex
from matplotlib.font_manager import FontProperties
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Load main dataframe
file_path = r"C:\Users\Lyagovich\Documents\Portfolio\GPU Project\Cleaned_Transformed_Data.csv"
df = pd.read_csv(file_path)
In [3]:
df.head()
Out[3]:
name price condition brand model memory powerPerformance category performance_score
0 ASUS NVIDIA GeForce RTX 3060 Ti 8 GB 259.99 Pre-Owned ASUS GeForce RTX 3060 Ti 8 101.03 Desktop 14908.0
1 XFX AMD Radeon RX 580 8 GB 79.95 Pre-Owned XFX Radeon RX 580 8 48.14 Desktop 8313.5
2 EVGA NVIDIA GeForce RTX 3060 Ti 8 GB 225.00 Pre-Owned EVGA GeForce RTX 3060 Ti 8 101.03 Desktop 14908.0
3 EVGA NVIDIA GeForce RTX 3070 8 GB 243.00 Pre-Owned EVGA GeForce RTX 3070 8 100.42 Desktop 15891.5
4 ZOTAC NVIDIA GeForce GTX 1060 6 GB 70.00 Pre-Owned ZOTAC GeForce GTX 1060 6 83.92 Desktop 8845.0

Column explanations:¶

  1. name: Full name of the GPU, includes brand and specific model.
  2. price: Price of the GPU in dollars.
  3. condition: Condition of the GPU (e.g., 'New', 'Pre-Owned').
  4. brand: Manufacturer of the GPU (e.g., ASUS, XFX).
  5. model: Specific model of the GPU (e.g., 'GeForce RTX 3060 Ti').
  6. memory: Memory size of the GPU, measured in GB.
  7. powerPerformance: Measure of the GPU's power performance.
  8. performance_score: Combined score of G2D and G3D performance metrics. Calculated as (G2D * 10 + G3D) / 2.



1. Brand and Model Analysis:¶

Brand Frequency Distribution: A basic bar chart showing the frequency distribution of different brands in the dataset. This would give an overview of the most common brands in the dataset.

In [4]:
# Define custom color palettes
custom_palette = ["#002E29", "#004238", "#005447", "#006655", "#007764", "#008872", "#009781", "#00A690", "#00B59F", "#00C4AD"]
custom_palette_reversed = custom_palette[::-1]



#------------------ Part 1: Frequency Distribution of Brands ------------------

# Calculate brand count and normalize count values
brand_count = df['brand'].value_counts().reset_index().rename(columns={'index': 'brand', 'brand': 'count'})
brand_count['norm_count'] = 1 - (brand_count['count'] - brand_count['count'].min()) / (brand_count['count'].max() - brand_count['count'].min())

# Create the figure and the axes
fig, ax = plt.subplots(figsize=(10, 8))

# Plot using the custom color palette
sns.barplot(x='brand', y='count', data=brand_count, palette=custom_palette)

# Despine and configure labels, title, and ticks
sns.despine()
ax.set_xlabel(".", fontsize=0, labelpad=0)
ax.set_ylabel(".", fontsize=0, labelpad=0)
plt.xticks(rotation=0, ha="center", fontsize=8)
plt.yticks(fontsize=12)
plt.title('Frequency Distribution of Brands', fontsize=20, pad=20)
plt.tight_layout()
plt.show()





#------------------ Part 2: Sales Distribution by Brands ------------------

# Calculate sales count by brand and total sales count
sales_count_by_brand = df.groupby('brand')['name'].count().reset_index().rename(columns={'name': 'Sales Count'})
total_sales = sales_count_by_brand['Sales Count'].sum()

# Calculate and format the percentage
sales_count_by_brand['Percentage'] = ((sales_count_by_brand['Sales Count'] / total_sales) * 100).round(2).astype(str) + '%'

# Sort the DataFrame by sales count in descending order and normalize 'Sales Count'
sales_count_by_brand = sales_count_by_brand.sort_values('Sales Count', ascending=False)
norm = Normalize(sales_count_by_brand['Sales Count'].min(), sales_count_by_brand['Sales Count'].max())  # Normalization range

# Create a color map using the reversed custom palette
cmap = ListedColormap(custom_palette_reversed)

# Map normalized 'Sales Count' values to the color map
colors = [to_hex(cmap(norm(x))) for x in sales_count_by_brand['Sales Count']]

# Create the Plotly Table
fig = go.Figure(data=[go.Table(
    header=dict(values=list(sales_count_by_brand.columns),
                fill_color='#000000',
                align='left',
                font=dict(color='white', size=13)),
    cells=dict(values=[sales_count_by_brand.brand, sales_count_by_brand['Sales Count'], sales_count_by_brand.Percentage],
               fill_color=[colors]*3,
               font=dict(color='white', size=12),
               align='left',
               height=35))  # Increase the cell height here
])



# Adjust the layout
fig.update_layout(width=975, height=575)
fig.show()

Market Domination: The top 5 brands (EVGA, ASUS, MSI, GIGABYTE, NVIDIA) dominate the market, comprising nearly 80% of all sales. This indicates a high concentration of power in a small number of brands.

EVGA's Strong Position: EVGA, in particular, stands out with nearly 24% of total sales, showing its strong position in the market.

Significant Drop after Top 5: There is a significant drop in sales percentages after the top 5 brands. The sixth brand, ZOTAC, holds only about 7.75% of the market, nearly half of NVIDIA's percentage, the fifth-ranking brand. This reveals a substantial gap between the top-tier and second-tier brands.




In [5]:
# Define custom color palette
custom_palette = ["#002E29", "#004238", "#005447", "#006655", "#007764", "#008872", "#009781", "#00A690", "#00B59F", "#00C4AD"]

# Get the top 10 GPU models by frequency
top_10_models = df.model.value_counts().nlargest(10)

# Create the figure and the axes
fig, ax = plt.subplots(figsize=(11, 8))

# Plot using the custom color palette
sns.barplot(y=top_10_models.index, x=top_10_models.values, palette=custom_palette, ax=ax)

# Despine
sns.despine(left=False, bottom=True, right=True, top=False)

# Add frequency counts at the end of the bars
for patch in ax.patches:
    ax.text(patch.get_width() + 2.9, patch.get_y() + .5, 
            str(int(patch.get_width())), 
            fontsize=11, color='Grey')

# Set labels and title
ax.set_xlabel(".", fontsize=0)
ax.set_ylabel(".", fontsize=0)
ax.xaxis.tick_top()  # Move x-axis labels to top
ax.xaxis.set_label_position('top')  # Move x-axis title to top

plt.title('Top 10 GPU Models by Sales', fontsize=20, pad=30)  # Add padding to the title
plt.show()

This plot represents the top 10 GPU models by frequency of occurrence in the dataset. The model 'GeForce RTX 3080' appears most frequently with a count of 494, followed closely by 'GeForce RTX 3070' with a count of 456. On the other end of the spectrum, 'GeForce RTX 2060' appears least frequently amongst the top 10 models, with a count of 165. It's also noteworthy that the list includes both GeForce RTX and Radeon RX models, indicating a diverse set of GPU models in use.




Average Price by Brand: A boxplot showing the average price distribution by brand. It would also show outliers which could be very expensive models.

In [6]:
# Define custom color palette
custom_palette = ["#002E29", "#004238", "#005447", "#006655", "#007764", "#008872", "#009781", "#00A690", "#00B59F", "#00C4AD"]

def plot_price_distribution(df, show_outliers=True, y_limit=None):
    # Calculate mean price by brand
    mean_price_by_brand = df.groupby('brand')['price'].mean().sort_values(ascending=False)
    brand_order = mean_price_by_brand.index

    # Create the figure and the axes
    fig, ax = plt.subplots(figsize=(11, 8))

    # Plot
    sns.boxplot(x='brand', y='price', data=df, order=brand_order, ax=ax, palette=custom_palette, showfliers=show_outliers)

    # Despine
    sns.despine(left=False, bottom=False, right=True, top=True)

    # Set the y-axis limit if defined
    if y_limit:
        ax.set_ylim(0, y_limit)

    # Improve readability of the plot
    plt.xticks(rotation=0, ha='center', fontsize=8)
    ax.set_xlabel(".", fontsize=0)
    ax.set_ylabel(".", fontsize=0)

    # Add a title
    title = 'Price Distribution by Brand (Excluding Outliers)' if not show_outliers else 'Price Distribution by Brand'
    plt.title(title, fontsize=20)

    # Format y-axis labels with dollar sign
    formatter = ticker.FormatStrFormatter('$%1.0f')
    ax.yaxis.set_major_formatter(formatter)

    # Adjust the layout
    plt.tight_layout()

    # Show the plot
    plt.show()

# Call the function
plot_price_distribution(df)
plot_price_distribution(df, show_outliers=False, y_limit=850)

Key Insight: Price Variation and Bargain Opportunities: The standard deviation (std) reveals the price variability within each brand. Higher standard deviations indicate wider price ranges for graphics cards within a brand, creating opportunities for buyers to find bargains or deals. By monitoring prices and comparing listings, buyers can identify relatively lower-priced options or take advantage of price fluctuations within specific brands.

  • Brand Reputation and Performance: Higher-priced brands like EVGA, GIGABYTE, and NVIDIA are known for delivering high-performance graphics cards due to advanced technologies and superior build quality.

  • Brand Loyalty and Resale Value: Brands like EVGA and ASUS often enjoy customer loyalty and have a reputation for better resale value over time.

We will later need to look into the price variation within the model set to gauge the brand value.




Price vs Performance by Brand: A scatterplot to show the relationship between price and performance, color-coded by brand. This would reveal if more expensive GPUs have better performance and if there are any brand-specific trends.

Price vs Performance by Brand | Scatter Plot¶

In [7]:
# Define color palette
color_palette = ["#2F7ABF", "#6699BB", "#66B2B2", "#66B2B2", "#D9D4C0", "#33591E", "#236B99", "#003c32", "#008872"]

# Create color cycler
colors = cycle(color_palette)

fig = go.Figure()
brands = df['brand'].unique()

for brand in brands:
    brand_data = df[df['brand'] == brand]
    fig.add_trace(go.Scatter(
        x=brand_data['price'],
        y=brand_data['performance_score'],
        mode='markers',
        name=brand,
        marker=dict(color=next(colors), size=8, opacity=0.3),
        hoverinfo='text',
        text=brand_data.apply(lambda row: f"Brand: {row['brand']}<br>Performance Score: {row['performance_score']}<br>Price: {row['price']}", axis=1)
    ))

fig.update_layout(
    title='Price vs Performance by Brand | Scatter Plot',
    xaxis_title='Price',
    yaxis_title='Performance Score',
    legend=dict(orientation="v", yanchor="bottom", y=0.05, xanchor="right", x=0.95),
    plot_bgcolor='white',
    height=800
)

fig.show()

The scatter plot illustrates the relationship between the price and performance score of graphics cards from various brands on eBay.

  • Higher prices tend to correspond to higher performance scores, indicating a positive correlation.

  • Outliers in the plot indicate the presence of graphics cards that are either significantly underpriced or overpriced relative to their performance.

  • The lines in the plot represent different purchase prices for the same graphics card model, revealing variations in pricing choices among buyers.

  • The graph suggests a point of diminishing returns, where the incremental performance improvement diminishes as graphics card prices increase.


Price vs Performance by Brand | Polynomial Regression Plot¶

In [8]:
# Polynominal regression plot. Very condensed code. Not meant to be readable.
brands, brand_colors = df['brand'].unique().tolist(), ["#2F7ABF", "#6699BB", "#66B2B2", "#66B2B2", "#D9D4C0", "#33591E", "#236B99", "#003c32", "#008872"][:len(df['brand'].unique().tolist())]
traces = [go.Scatter(x=np.linspace((bd:=df[df['brand'] == b])['price'].min(), bd['price'].max(), 100), y=np.poly1d(np.polyfit(bd['price'], bd['performance_score'], 4))(np.linspace(bd['price'].min(), bd['price'].max(), 100)), mode='lines', name=b, line=dict(color=color, width=2)) for b, color in zip(brands, brand_colors)]
buttons = [dict(label="All", method="update", args=[{"visible": [True]*len(brands)}], args2=[{"visible": [False]*len(brands)}])]+[dict(label=f'<span style="color:{color}">&#9679;</span> {b}', method="update", args=[{"visible": [True if j == i else False for j in range(len(brands))]}], args2=[{"visible": [False if j == i else True for j in range(len(brands))]}]) for i, (b, color) in enumerate(zip(brands, brand_colors))]
go.Figure(data=traces, layout=go.Layout(title='Price vs Performance by Brand |  Polynomial Regression Plot', xaxis=dict(title='Price'), yaxis=dict(title='Performance Score'), plot_bgcolor='white', height=800, margin=dict(t=100))).update_traces(marker=dict(size=0)).update_layout(updatemenus=[dict(type="buttons", buttons=buttons, direction="down", showactive=True, x=1.05, y=0.95, xanchor="left", yanchor="top", bgcolor='rgba(0,0,0,0)', bordercolor='rgba(0,0,0,0)', font=dict(color='black'))], showlegend=False).show()

Insights from the Scatter Plot

This scatter plot is one of the most insightful graphs in the entire workbook. It sheds light on the relationship between price and performance score of graphics cards from multiple brands on eBay.

Key Takeaway:

  • The more popular brands exhibit a stronger correlation between performance and price compared to less popular brands.

Additional Points:

  • Beyond the 600 dollars price point, there is a diminishing return in performance improvement with increasing price. This phenomenon could be attributed to the purchase of graphics cards as collectible items or the consideration of factors other than the primary utility of the card.

  • Among all the brands, XFX provides the best value around the 475 dollars price range, but offers the worst value at 275 dollars.

  • PowerColor has found its niche in selling the highest performing cards, positioning itself as a brand known for top-tier performance.

  • ASUS, while slightly lagging behind the curve, achieves success in the market through factors other than the price-to-performance ratio.

  • Popular brands demonstrate greater consistency in their value proposition at each price level.


In [9]:
# First, we aggregate our data by 'brand' and 'model', 
# and compute the mean 'price' and count of 'name' (which we'll rename to 'sales_count').
agg_df = df.groupby(['brand', 'model']).agg({'price': 'mean', 'name': 'count'}).reset_index()
agg_df.rename(columns={'name': 'sales_count'}, inplace=True)

# Now we'll create a treemap with a custom color scheme.
# We specify the hierarchy with the 'path' parameter, 'values' for sizing, and 'color' for the color scale.
# We also add custom hover data and labels.
fig = px.treemap(agg_df, 
                 path=['brand', 'model'], 
                 values='sales_count', 
                 color='price',
                 color_continuous_scale=["#DCEBF7", "#CDDFF3", "#BECFF0", "#AFC3ED", "#A0B7E9", "#91ABE6", "#82A0E2", "#7394DF", "#6488DB", "#557CD8"],
                 hover_data={'brand': True, 'model': True, 'sales_count': ':,', 'price': ':,.2f'},
                 labels={'brand': 'Brand', 'model': 'Model', 'sales_count': 'Count', 'price': 'Average Price'},
                 title='Sales Count and Average Price by Brand and Model')

# We update the hovertemplate to remove 'id' from hover labels.
fig.update_traces(hovertemplate='<b>%{label}</b>: %{customdata} <extra></extra>')

# Next, we set the figure size.
fig.update_layout(width=960, height=780)

# We remove the margin on the right side.
fig.update_layout(margin=dict(r=0))

# Finally, we display our treemap.
fig.show()

This visualization allows to more easily conceptualize both the sales by brands and by model within brand.




2. Condition Analysis:¶

1. Bar Chart of GPU Count by Condition¶

In [10]:
# We are excluding the 'Very Good - Refurbished' condition from our data
condition_count = df[df['condition'] != 'Very Good - Refurbished']['condition'].value_counts().reset_index()
condition_count.columns = ['condition', 'count']

# Creating the figure and the axes
fig, ax = plt.subplots(figsize=(11, 8))

# Defining the custom color palette
custom_palette = ["#2B5F57", "#3B7E68", "#4C9D79", "#5CAC89", "#6BBB9A", "#7AC9AA", "#8AD8BB", "#99E6CB", "#A8F5DC", "#B8F4EC"]

# Creating the bar plot with the custom color palette
sns.barplot(x='condition', y='count', data=condition_count, palette=custom_palette)

# Removing the spines and grid for cleaner look
sns.despine()
ax.grid(False)

# Setting the labels and the title with custom fontsizes and padding for the title
ax.set_xlabel(".", fontsize=0, labelpad=0)
ax.set_ylabel(".", fontsize=0, labelpad=0)
plt.xticks(rotation=0, ha="center", fontsize=12)
plt.yticks(fontsize=12)

# Setting the title color to black
plt.title('Count of GPUs by Condition', fontsize=20, pad=20, color='black')

# Adjusting the layout and showing the plot
plt.tight_layout()
plt.show()

The was majority of GPU sales on eBay comes from used cards.

Note I exclude the "Very Good - Refurbished" category because there is only one value in it




2. Boxplot of Prices by Condition¶

In [11]:
# First, we filter out rows with 'Very Good - Refurbished' condition
df_filtered = df[df['condition'] != 'Very Good - Refurbished']

# Define the custom color palette
custom_palette = ["#002E29", "#004238", "#005447", "#006655", "#007764", "#008872", "#009781", "#00A690", "#00B59F", "#00C4AD"]

# Setup the figure with a specific size
plt.figure(figsize=(11, 8))

# Create a boxplot, passing in the filtered DataFrame and the custom color palette
sns.boxplot(x='condition', y='price', data=df_filtered, palette=custom_palette)

# Remove the top and right spines from plot for a cleaner look
sns.despine()

# Remove grid
plt.grid(False)

# Add title and labels. Use '.' and fontsize=0 to hide x and y labels.
plt.title('GPU Prices by Condition', fontsize=18, color='black')
plt.xlabel('.', fontsize=0)
plt.ylabel('.', fontsize=0)

# Adjust subplot params so that the subplot fits into the figure area.
plt.tight_layout()

# Display the plot
plt.show()

Brand New graphics cards have a wider price range compared to Open Box and Refurbished cards, indicating that there is a greater variation in pricing for new products.

Despite being classified as Pre-Owned, there is a notable number of outliers among the prices of Pre-Owned graphics cards, suggesting that some pre-owned cards are listed at significantly higher prices, potentially due to their rarity or unique features.

The standard deviation of prices is highest for Brand New graphics cards, indicating a wider spread of prices compared to other conditions. This implies that there can be a substantial price difference among Brand New cards, even within the same condition category.

The count of Refurbished graphics cards is relatively low compared to the other condition categories, potentially indicating a smaller market presence or limited availability of refurbished options on eBay for graphics cards.




3. Scatter Plot of Performance Score vs Price, Colored by Condition¶

In [12]:
# First, we filter out rows with 'Very Good - Refurbished' condition
df_filtered = df[df['condition'] != 'Very Good - Refurbished']

# Define the custom color palette
custom_palette = ["#002E29", "#004238", "#005447", "#006655", "#007764", "#008872", "#009781", "#00A690", "#00B59F", "#00C4AD"]

# Setup the figure with a specific size
plt.figure(figsize=(11, 8))

# Create a boxplot, passing in the filtered DataFrame and the custom color palette
sns.boxplot(x='condition', y='price', data=df_filtered, palette=custom_palette)

# Remove the top and right spines from plot for a cleaner look
sns.despine()

# Remove grid
plt.grid(False)

# Add title and labels. Use '.' and fontsize=0 to hide x and y labels.
plt.title('GPU Prices by Condition', fontsize=18, color='black')
plt.xlabel('.', fontsize=0)
plt.ylabel('.', fontsize=0)

# Adjust subplot params so that the subplot fits into the figure area.
plt.tight_layout()

# Display the plot
plt.show()



#------------------ Part 2: Polynomial regression plot ------------------


# Polynomial regression plot. Very condensed code. Not meant to be readable.
conditions, condition_colors = df_filtered['condition'].unique().tolist(), ["#138b81", "#8b1359", "#45138b", "#0000CD"]
traces = [go.Scatter(y=np.linspace((cd:=df_filtered[df_filtered['condition'] == c])['performance_score'].min(), cd['performance_score'].max(), 100), x=np.poly1d(np.polyfit(cd['performance_score'], cd['price'], 4))(np.linspace(cd['performance_score'].min(), cd['performance_score'].max(), 100)), mode='lines', name=c, line=dict(color=color, width=2)) for c, color in zip(conditions, condition_colors)]
buttons = [dict(label="All", method="update", args=[{"visible": [True]*len(conditions)}], args2=[{"visible": [False]*len(conditions)}])]+[dict(label=f'<span style="color:{color}">&#9679;</span> {c}', method="update", args=[{"visible": [True if j == i else False for j in range(len(conditions))]}], args2=[{"visible": [False if j == i else True for j in range(len(conditions))]}]) for i, (c, color) in enumerate(zip(conditions, condition_colors))]
go.Figure(data=traces, layout=go.Layout(title='Performance vs Price by Condition | Polynomial Regression Plot', yaxis=dict(title='Performance Score', linecolor='black', showline=True, linewidth=2, ticks="outside", tickcolor='black', title_font=dict(size=18, color='white')),xaxis=dict(title='$ Price', linecolor='black', showline=True, linewidth=2, ticks="outside", tickcolor='black', title_font=dict(size=18, color='white')), plot_bgcolor='white', height=800, margin=dict(t=100),font=dict(family="Courier New, monospace",size=18,  color='black' ),title_font=dict(size=18,  color='black' ),paper_bgcolor='white')).update_traces(marker=dict(size=0)).update_layout(updatemenus=[dict(type="buttons", buttons=buttons, direction="down", showactive=True, x=1.05, y=0.95, xanchor="left", yanchor="top", bgcolor='rgba(0,0,0,0)', bordercolor='rgba(0,0,0,0)', font=dict(color='black'))], showlegend=False).show()
  • There is a rapid increase in the performance score with increasing prices, especially noticeable until the 200 - 250 dollars price range. This indicates a high return on investment in terms of performance for this price range.

  • Beyond the 250 dollars price point, the performance score continues to correlate positively with increasing prices, but the rate of increase is slower. This suggests diminishing returns on investment beyond this price range.

  • In terms of performance per dollar spent, brand-new items offer the least value, whereas pre-owned items provide the highest value. This can be inferred from their respective regression curves' positions and the associated scatter points.

  • There is a region of overlap in value around the 100 dollars mark, indicating similar performance per dollar across different conditions in this price range.




3. Performance Analysis¶

The following chart are a bit more complicated and are made largely for fun 😊


1. Price vs Performance | Hexbin Marginals Plot¶

In [13]:
# Set the global font to be 'Verdana'
plt.rcParams['font.family'] = 'Verdana'

# Create a hexbin plot with a custom color palette and grid size
g = sns.jointplot(
    data=df_filtered, 
    x="price", 
    y="performance_score", 
    kind="hex", 
    color="#a1c9c9", 
    height=10, 
    space=0.2, 
    gridsize=20
)

# Set the title of the plot
g.fig.suptitle('Price vs Performance | Hexbin Plot', fontsize=16, color='black')

# Set the labels of the x and y axes
g.ax_joint.set_xlabel('Price', fontsize=10, color='black')
g.ax_joint.set_ylabel('Performance Score', fontsize=10, color='black')

# Add padding to the labels of the x and y axes
g.ax_joint.xaxis.labelpad = 5
g.ax_joint.yaxis.labelpad = 5

# Set the color of the tick parameters
g.ax_joint.tick_params(colors='black', grid_alpha=0)

# Set the color of the figure (paper) background to white
g.fig.set_facecolor('white')

# Set the color of the plot (axes) background to white
g.ax_joint.set_facecolor('white')

# Adjust the space at the top of the figure for the title
plt.subplots_adjust(top=0.9)

# Display the plot
plt.show()

The hexbin plot visually depicts the relationship between price and performance. Each hexagon represents a group of items, with darker hexagons indicating a higher concentration of items. It is observed that as price increases, performance generally tends to increase, although the relationship is not linear. Notably, there are clusters of dark hexagons at lower prices, suggesting that certain items offer high performance at a more affordable cost.

The histograms displayed on the top and right sides of the hexbin plot provide insights into the distribution of prices and performance scores individually. The top histogram illustrates the frequency of different price values, while the right histogram portrays the frequency of various performance score values.

These histograms enhance our understanding of the distribution patterns and ranges for each variable. They enable us to identify common price ranges, performance score distributions, and potential outliers. In essence, the histograms complement the hexbin plot by providing a closer examination of the individual variables, shedding light on any noteworthy trends or patterns.

Based on the plot, it appears that the intersection point of approximately 16,000 items and $300 represents a particularly appealing combination to a significant number of people, as indicated by the concentration of hexagons in that region.

2. Price vs Performance | Marginal Boxplot¶

In [14]:
# Create figure and axes
fig = plt.figure(figsize=(12, 9))
grid = plt.GridSpec(7, 7, hspace=0.5, wspace=0.2)

# Define the axes
ax_main = fig.add_subplot(grid[:-1, :-1])
ax_bottom = fig.add_subplot(grid[-1, :-1], xticklabels=[], yticklabels=[])
ax_right = fig.add_subplot(grid[:-1, -1], xticklabels=[], yticklabels=[])

# Scatter plot on main axis
sns.scatterplot(data=df, x='price', y='performance_score', ax=ax_main, color='#005f65')

# Boxplot on the bottom axis
sns.boxplot(data=df, x='price', ax=ax_bottom, color='#00777f')
ax_bottom.set(xlabel='')
ax_bottom.xaxis.set_ticks([])

# Boxplot on the right axis
sns.boxplot(data=df, y='performance_score', ax=ax_right, color='#00777f', orient='h')
ax_right.set(ylabel='')
ax_right.yaxis.set_ticks([])

# Set the title for main axis and remove labels
ax_main.set(title='Price vs Performance', xlabel='', ylabel='')

# Modify the title color, size, and position
ax_main.title.set_color('black')
ax_main.title.set_fontsize(14)
ax_main.title.set_position([.5, 1.05])

# Remove the box around all axes
sns.despine(ax=ax_main, left=False, bottom=False)
sns.despine(ax=ax_right, left=True, bottom=True, right=True)
sns.despine(ax=ax_bottom, left=True, bottom=True, top=True)

# Adjust layout for proper spacing
plt.subplots_adjust(wspace=0.1, hspace=0.1)

# Show the plot
plt.show()
C:\Users\Lyagovich\anaconda3\lib\site-packages\seaborn\_oldcore.py:1592: UserWarning:

Horizontal orientation ignored with only `y` specified.

This plot is designed to give a quick visual overview of both the individual data points (using a scatter plot) and the overall distribution of the data (using box plots).

The scatter plot in the center shows individual data points with the 'price' on the x-axis and the 'performance_score' on the y-axis. Each dot represents a single GPU and its position indicates its price and performance.

The box plots, on the other hand, provide a summary of the distribution of the 'price' and 'performance_score'. The boxes represent the interquartile range (IQR), which is the range between the 25th percentile (the lower edge of the box) and the 75th percentile (the upper edge). The line inside the box indicates the median, or the 50th percentile. The lines extending from the boxes, called whiskers, show the range of the data, while outliers may be represented as dots beyond the whiskers.

  1. The average (mean) price of the GPUs is approximately 259.28, and the average performance score is approximately 13617.23.

  2. There is quite a bit of variability in the prices and performance scores of GPUs, as indicated by the standard deviation values. The standard deviation for the price is about 171.29, and for the performance score, it is approximately 3305.79.

  3. The median price of the GPUs is 232.50, and the median performance score is 13812.00, suggesting that at least half the GPUs are priced below 232.50 and have a performance score less than 13812.00.

  4. There is very large amount of outliers in prices.

In [15]:
# Select the columns related to performance
performance_cols = ['performance_score', 'powerPerformance', 'price', 'memory']

# Compute the correlation matrix
correlation_matrix = df[performance_cols].corr()

# Create the heatmap plot
fig = go.Figure(data=go.Heatmap(
    z=correlation_matrix.values,  # Correlation values
    x=correlation_matrix.columns,  # X-axis labels
    y=correlation_matrix.columns,  # Y-axis labels
    colorscale=[[0, ' #005f66 '], [1, '#e7fbff']],  # Use a monochromatic colorscale
    reversescale=True,  # Reverse the color scale
    colorbar=dict(title='Correlation'),  # Add colorbar title
))

# Add text annotations on top of the heatmap squares
for i in range(len(correlation_matrix.columns)):
    for j in range(len(correlation_matrix.columns)):
        fig.add_annotation(
            x=correlation_matrix.columns[i],
            y=correlation_matrix.columns[j],
            text=f"{correlation_matrix.iloc[j, i]:.2f}",  # Format the correlation value
            showarrow=False,
            font=dict(color='white' if correlation_matrix.iloc[j, i] > 0.5 else 'black'),  # Customize text color
            xref='x',
            yref='y',
        )

# Update layout
fig.update_layout(
    width=975, height=800,  # Set the figure size
    title='Correlation Matrix of Variables Related to Performance',  # Add title at the top
    xaxis=dict(title='', showticklabels=True, ticks='outside', tickfont=dict(size=10)),  # Customize x-axis
    yaxis=dict(title='', showticklabels=True, ticks='outside', tickfont=dict(size=10)),  # Customize y-axis
    plot_bgcolor='white',  # Set the background color to white
    margin=dict(t=50, b=50, l=50, r=50),  # Add margins
    paper_bgcolor='white',  # Set the color of the plot area
)

# Show the plot
fig.show()

This plot is a heatmap of the correlation matrix. A correlation matrix is a table that shows the correlation coefficients between different variables. Each cell in the table shows the correlation between two variables.

  1. 'performance_score' and 'price' correlation (0.82): This is the strongest relationship in the matrix, meaning the performance score of a product is strongly related to its price. As the performance score increases, the price tends to increase as well.

  2. 'performance_score' and 'memory' correlation (0.65): There's a moderate-to-strong positive correlation, which implies that products with higher performance scores also tend to have higher memory.

  3. 'price' and 'memory' correlation (0.60): There's a moderate positive correlation. As the price of the product increases, it tends to have more memory, but not as strongly as the performance score.

  4. 'powerPerformance' and the other three variables: The correlations here are quite weak. The strongest is with 'performance_score' at 0.19, but this is still weak and suggests a small positive relationship. Interestingly, 'powerPerformance' has virtually no correlation with 'memory' (-0.003), meaning these two variables change independently of each other.

The key takeaway here is that the 'performance_score' is the most influential variable in this dataset, showing strong correlations with 'price' and 'memory'. 'powerPerformance', on the other hand, doesn't appear to have a strong relationship with any of the other variables. This could suggest that 'powerPerformance' might not be a significant factor in the pricing or memory capacity of the products.

In [16]:
# Create the histogram plot using Plotly
fig = px.histogram(df, x='performance_score', nbins=20, color_discrete_sequence=['#005f65'])

# Update layout
fig.update_layout(
    width=975, height=800,  # Set the figure size
    title='Performance vs Price | Histogram',  # Add title at the top
    xaxis=dict(title='', showticklabels=True, ticks='outside', tickfont=dict(size=10), showline=True, linewidth=1, linecolor='black'),  # Customize x-axis
    yaxis=dict(title='', showticklabels=True, ticks='outside', tickfont=dict(size=10), showline=True, linewidth=1, linecolor='black'),  # Customize y-axis
    plot_bgcolor='white',  # Set the background color to white
    bargap=0.1,  # Set the gap between bars
    margin=dict(t=50, b=50, l=50, r=50),  # Add margins
    paper_bgcolor='white',  # Set the color of the plot area
    showlegend=False,  # Hide the legend
)

# Show the plot
fig.show()

The distribution also indicates that very high-performance scores (17090.4 - 18598.5) are not uncommon, but they are less frequent than the mid-range scores. This could indicate a smaller market for high-performance (and potentially higher-cost) products, or a limit to the performance achievable with current technology.




Conclusion¶

The workbook provides a comprehensive analysis of a dataset on graphics processing units (GPUs). The five key insights from the analysis are as follows:

  1. Market Domination: The market is dominated by a handful of brands, namely EVGA, ASUS, MSI, GIGABYTE, and NVIDIA, which account for nearly 80% of all sales. Notably, EVGA leads with nearly 24% of total sales, indicating its strong market position.

  2. Price Variation and Bargain Opportunities: The standard deviation reveals significant price variability within each brand, indicating potential opportunities for buyers to find bargains or deals. Higher-priced brands like EVGA, GIGABYTE, and NVIDIA are associated with higher-performance graphics cards, while brands like EVGA and ASUS are known for better resale value over time.

  3. Relationship between Price and Performance: There is a positive correlation between the price and performance score of GPUs. However, beyond a certain price point, the incremental performance improvement diminishes, suggesting a point of diminishing returns for graphics cards as prices increase.

  4. Sales and Condition of GPUs: The majority of GPU sales on eBay are from used cards. However, brand-new GPUs have a wider price range, indicating greater variation in pricing for new products. Despite being classified as pre-owned, some graphics cards are listed at significantly higher prices due to their rarity or unique features.

  5. Correlation Analysis: The 'performance_score' is the most influential variable in the dataset, showing strong positive correlations with 'price' and 'memory'. In contrast, 'powerPerformance' does not have a strong relationship with any of the other variables, suggesting that it might not be a significant factor in the pricing or memory capacity of the products.

These insights can help stakeholders understand the dynamics of the GPU market, including the major brands, the relationship between price and performance, and the influence of variables like 'performance_score' and 'powerPerformance'.

Further Areas of Exploration:¶

In this data analysis Python workbook, there are several areas that warrant further investigation. These include exploring outliers, understanding the power-performance relationship, conducting brand-specific analysis, investigating high price variability, determining the point of diminishing returns, and exploring the pre-owned market. By delving deeper into these aspects, we can gain valuable insights to enhance our understanding of the dataset.